Our example concerns a big company that wants to understand why some of their best and most experienced employees are leaving prematurely. The company also wishes to predict which valuable employees will leave next.
We have two goals: first, we want to understand why valuable employees leave, and second, we want to predict who will leave next.
Therefore, we propose to work with the HR department to gather relevant data about the employees and to communicate the significant effect that could explain and predict employees’ departure.
Unfortunately, managers didn’t kept an organised record of why people have left, but we can still find some explications in our data set provided by the HR department.
For our 15 000 employees we know: satisfaction level, latest evaluation (yearly), number of project worked on, average monthly hours, time spend in the company (in years), work accident (within the past 2 years), promotion within the past 5 years, department and salary.
This is the database from the HR department: (Note that it doesn’t take into account the person that have been fired, transferred or hired in the past year…)
## satisfaction_level last_evaluation number_project average_montly_hours
## 1 0.38 0.53 2 157
## 2 0.80 0.86 5 262
## 3 0.11 0.88 7 272
## 4 0.72 0.87 5 223
## 5 0.37 0.52 2 159
## 6 0.41 0.50 2 153
## time_spend_company Work_accident left promotion_last_5years sales salary
## 1 3 0 1 0 sales low
## 2 6 0 1 0 sales medium
## 3 4 0 1 0 sales medium
## 4 5 0 1 0 sales low
## 5 3 0 1 0 sales low
## 6 3 0 1 0 sales low
At this stage we want to understand the data that compose our Analytical Base Table (ABT) and assess where the quality of it might suffer.
This table describe the characteristics of each features of our ABT. We can see different statistical measures of central tendency and variation. For example we can see that our attrition rate is equal to 24%, the satisfaction level is around 62% and the performance average is around 71%. We see that on average people work on 3 to 4 projects a year and about 200 hours per months.
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years sales salary
## Min. :0.00000 sales :4140 high :1237
## 1st Qu.:0.00000 technical :2720 low :7316
## Median :0.00000 support :2229 medium:6446
## Mean :0.02127 IT :1227
## 3rd Qu.:0.00000 product_mng: 902
## Max. :1.00000 marketing : 858
## (Other) :2923
This graph present the correlations between each variables. The size of the bubbles reveal the significance of the correlation, while the colour present the direction (either positive or negative).
HR_correlation <- hr %>% select(satisfaction_level:promotion_last_5years)
M <- cor(HR_correlation)
corrplot(M, method="circle")
On average people who leave have a low satisfaction level, they work more and didn’t get promoted within the past five years.
cor(HR_correlation)
## satisfaction_level last_evaluation number_project
## satisfaction_level 1.00000000 0.105021214 -0.142969586
## last_evaluation 0.10502121 1.000000000 0.349332589
## number_project -0.14296959 0.349332589 1.000000000
## average_montly_hours -0.02004811 0.339741800 0.417210634
## time_spend_company -0.10086607 0.131590722 0.196785891
## Work_accident 0.05869724 -0.007104289 -0.004740548
## left -0.38837498 0.006567120 0.023787185
## promotion_last_5years 0.02560519 -0.008683768 -0.006063958
## average_montly_hours time_spend_company
## satisfaction_level -0.020048113 -0.100866073
## last_evaluation 0.339741800 0.131590722
## number_project 0.417210634 0.196785891
## average_montly_hours 1.000000000 0.127754910
## time_spend_company 0.127754910 1.000000000
## Work_accident -0.010142888 0.002120418
## left 0.071287179 0.144822175
## promotion_last_5years -0.003544414 0.067432925
## Work_accident left promotion_last_5years
## satisfaction_level 0.058697241 -0.38837498 0.025605186
## last_evaluation -0.007104289 0.00656712 -0.008683768
## number_project -0.004740548 0.02378719 -0.006063958
## average_montly_hours -0.010142888 0.07128718 -0.003544414
## time_spend_company 0.002120418 0.14482217 0.067432925
## Work_accident 1.000000000 -0.15462163 0.039245435
## left -0.154621634 1.00000000 -0.061788107
## promotion_last_5years 0.039245435 -0.06178811 1.000000000
On average people who leave have a low satisfaction level, they work more and didn’t get promoted within the past five years.
Let’s create a data frame with only the people that have left the company, so we can visualise what is the distribution of each features:
hr_hist <- hr %>% filter(left==1)
par(mfrow=c(1,3))
hist(hr_hist$satisfaction_level,col="#3090C7", main = "Satisfaction level")
hist(hr_hist$last_evaluation,col="#3090C7", main = "Last evaluation")
hist(hr_hist$average_montly_hours,col="#3090C7", main = "Average montly hours")
We can see why we don’t want to retain everybody. Some people don’t work well as we can see from their evaluation, but clearly there are also many good workers that leave.
par(mfrow=c(1,2))
hist(hr_hist$Work_accident,col="#3090C7", main = "Work accident")
plot(hr_hist$salary,col="#3090C7", main = "Salary")
In the total of 15 000 employees that compose our database, here are the people that have left:
## [1] 3571
More problematic, here are the total of employees that received an evaluation above average, or spend at least four years in the company, or were working on more than 5 projects at the same time and still have left the company. These are the people the company should have retained.
hr_good_leaving_people <- hr_leaving_people %>% filter(last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 5)
nrow(hr_good_leaving_people)
## [1] 2014
Let’s re-use the data table created above that contain only the most valuable employees and see why they tend to leave.
hr_good_leaving_people2 <- hr %>% filter(last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 5)
hr_good_people_select <- hr_good_leaving_people2 %>% select(satisfaction_level, number_project: promotion_last_5years)
M <- cor(hr_good_people_select)
corrplot(M, method="circle")
Here it’s much clearer. On average valuable employees that leave are not satisfayed, work on many projects, spend many hours in the company each month and aren’t promoted.
summary(hr_good_leaving_people2)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.090 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.490 1st Qu.:0.7300 1st Qu.:3.000 1st Qu.:171.0
## Median :0.680 Median :0.8300 Median :4.000 Median :218.0
## Mean :0.617 Mean :0.8015 Mean :4.159 Mean :211.8
## 3rd Qu.:0.830 3rd Qu.:0.9100 3rd Qu.:5.000 3rd Qu.:253.0
## Max. :1.000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 4.000 Median :0.0000 Median :0.0000
## Mean : 3.916 Mean :0.1521 Mean :0.2061
## 3rd Qu.: 5.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years sales salary
## Min. :0.00000 sales :2628 high : 834
## 1st Qu.:0.00000 technical :1786 low :4671
## Median :0.00000 support :1466 medium:4267
## Mean :0.02384 IT : 808
## 3rd Qu.:0.00000 product_mng: 582
## Max. :1.00000 marketing : 561
## (Other) :1941
Here it’s much clearer. On average valuable employees that leave are not satisfayed, work on many projects, spend many hours in the company each month and aren’t promoted.
Now we want to predict which valuable employe will leave next.
Let’s use the same database than above where we kept the most valuable employees. Here is the summary of that database.
hr_model <- hr %>% filter(last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 5)
summary(hr_model)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.090 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.490 1st Qu.:0.7300 1st Qu.:3.000 1st Qu.:171.0
## Median :0.680 Median :0.8300 Median :4.000 Median :218.0
## Mean :0.617 Mean :0.8015 Mean :4.159 Mean :211.8
## 3rd Qu.:0.830 3rd Qu.:0.9100 3rd Qu.:5.000 3rd Qu.:253.0
## Max. :1.000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 4.000 Median :0.0000 Median :0.0000
## Mean : 3.916 Mean :0.1521 Mean :0.2061
## 3rd Qu.: 5.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years sales salary
## Min. :0.00000 sales :2628 high : 834
## 1st Qu.:0.00000 technical :1786 low :4671
## Median :0.00000 support :1466 medium:4267
## Mean :0.02384 IT : 808
## 3rd Qu.:0.00000 product_mng: 582
## Max. :1.00000 marketing : 561
## (Other) :1941
After setting our cross-validation we build and compare different predictive models. The first one use a tree model, the second a naives bayes and the third a logistic regression.
# Set the target variable as a factor
hr_model$left <- as.factor(hr_model$left)
## install.packages("caret")
library("caret")
## Loading required package: lattice
# cross-validation
train_control<- trainControl(method="cv", number=5, repeats=3)
head(train_control)
## $method
## [1] "cv"
##
## $number
## [1] 5
##
## $repeats
## [1] 3
##
## $search
## [1] "grid"
##
## $p
## [1] 0.75
##
## $initialWindow
## NULL
library("rpart")
library("rpart.plot")
# train the model
rpartmodel<- train(left~., data=hr_model, trControl=train_control, method="rpart")
# make predictions
predictions<- predict(rpartmodel,hr_model)
hr_model_tree<- cbind(hr_model,predictions)
# summarize results
confusionMatrix<- confusionMatrix(hr_model_tree$predictions,hr_model_tree$left)
confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7591 276
## 1 167 1738
##
## Accuracy : 0.9547
## 95% CI : (0.9504, 0.9587)
## No Information Rate : 0.7939
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8586
## Mcnemar's Test P-Value : 2.878e-07
##
## Sensitivity : 0.9785
## Specificity : 0.8630
## Pos Pred Value : 0.9649
## Neg Pred Value : 0.9123
## Prevalence : 0.7939
## Detection Rate : 0.7768
## Detection Prevalence : 0.8051
## Balanced Accuracy : 0.9207
##
## 'Positive' Class : 0
##
# library("ROCR")
# hr_model_tree$predictions <- as.numeric(paste(hr_model_tree$predictions))
#
# perf.obj <- prediction(predictions=hr_model_tree$predictions, labels=hr_model_tree$left)
# # Get data for ROC curve
# roc.obj <- performance(perf.obj, measure="tpr", x.measure="fpr")
# plot(roc.obj,
# main="Cross-Sell - ROC Curves",
# xlab="1 – Specificity: False Positive Rate",
# ylab="Sensitivity: True Positive Rate",
# col="blue")
# abline(0,1,col="grey")
## Loading required package: kknn
##
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
##
## contr.dummy
# train the model
e1071model2 <- train(left~., data=hr_model, trControl=train_control, method="nb")
## Loading required package: klaR
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# make predictions
predictions<- predict(e1071model2,hr_model)
e1071modelbinded <- cbind(hr_model,predictions)
# summarize results
confusionMatrix<- confusionMatrix(e1071modelbinded$predictions,e1071modelbinded$left)
confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7663 601
## 1 95 1413
##
## Accuracy : 0.9288
## 95% CI : (0.9235, 0.9338)
## No Information Rate : 0.7939
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.76
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9878
## Specificity : 0.7016
## Pos Pred Value : 0.9273
## Neg Pred Value : 0.9370
## Prevalence : 0.7939
## Detection Rate : 0.7842
## Detection Prevalence : 0.8457
## Balanced Accuracy : 0.8447
##
## 'Positive' Class : 0
##
# train the model
gmlmodel <- train(left~., data=hr_model, trControl=train_control, method="LogitBoost")
## Loading required package: caTools
# make predictions
predictions<- predict(gmlmodel,hr_model)
gmlmodelbinded <- cbind(hr_model,predictions)
# summarize results
confusionMatrix<- confusionMatrix(gmlmodelbinded$predictions,gmlmodelbinded$left)
confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7671 319
## 1 87 1695
##
## Accuracy : 0.9585
## 95% CI : (0.9543, 0.9623)
## No Information Rate : 0.7939
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8674
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9888
## Specificity : 0.8416
## Pos Pred Value : 0.9601
## Neg Pred Value : 0.9512
## Prevalence : 0.7939
## Detection Rate : 0.7850
## Detection Prevalence : 0.8176
## Balanced Accuracy : 0.9152
##
## 'Positive' Class : 0
##
# library("ROCR")
# gmlmodelbinded$predictions <- as.numeric(paste(gmlmodelbinded$predictions))
#
# perf.obj <- prediction(predictions=gmlmodelbinded$predictions, labels=gmlmodelbinded$left)
# # Get data for ROC curve
# roc.obj <- performance(perf.obj, measure="tpr", x.measure="fpr")
# plot(roc.obj,
# main="Cross-Sell - ROC Curves",
# xlab="1 – Specificity: False Positive Rate",
# ylab="Sensitivity: True Positive Rate",
# col="blue")
# abline(0,1,col="grey")
The confusion matrix and the accuracy figures of the different model show that the predictive power is very similar and seems robust. About 95% accuracy and for a Kappa of 84%. We decide to keep the logistic regression model to lay out actionable insights. It’s a very simple model and give the best results.
Here is a plot that show the probability to leave of the employees and their performance. We need to focus on the top right. To do that we build a data table were we rank the probability to leave found in the logistic regression model and the performance, we therefore find the priority for the company.
set.seed(100)
# Keep some data to test again the final model
inTraining <- createDataPartition(hr_model$left, p = .75, list = FALSE)
training <- hr_model[ inTraining,]
testing <- hr_model[-inTraining,]
# Estimate the drivers of attrition
logreg = glm(left ~ ., family=binomial(logit), data=training)
# Make predictions on the out-of-sample data
probaToLeave=predict(logreg,newdata=testing,type="response")
# Structure the prediction output in a table
predattrition = data.frame(probaToLeave)
# Add a column to the predattrition dataframe containing the performance
predattrition$performance=testing$last_evaluation
plot(predattrition$probaToLeave,predattrition$performance)
Here we display the first 300 employees that the company should retain. After grouping them per department we could email the different managers to tell them which valuable employees might leave soon.
predattrition$priority=predattrition$performance*predattrition$probaToLeave
orderpredattrition=predattrition[order(predattrition$priority,decreasing = TRUE),]
orderpredattrition <- head(orderpredattrition, n=300)
datatable(orderpredattrition)